DataQuest Capstone: Exploratoray Data Analysis of Reader Engagement By Book Genre¶

Hypothesis and Approach¶

The goal of this project is to make a recommendation for the launch of an online book group. We want to know which types of books tend to lead to the most engaged readers. The books should both appeal to a large volume of readers and tend to drive discussion among readers, since these qualities would support the group's success.

From an informal poll of my network I've formed a hypothesis that fiction books would lend themselves better to an online book group than non-fiction. Now I want to examine a large dataset that will either lead me to support or reject that hypothesis. I also want to learn if there are more specific categories of books that would lead to more interest.

I'm using a dataset from Kaggle that compiles data for 100k books scraped from Goodreads. Goodreads is an extensive database of books that allows users to rate and review books, and to generate reading lists. As of 2022 the site reported 90 million registered members so we can assume that it reaches a broad audience of book readers.

We're making a couple of important assumptions:

  • That the dataset of 100k books from Kaggle is representative of the total population of books on the site.

  • That the behavior of users on Goodreads is a good proxy for readers' inclination to engage with an online book group. Specifically, we'll be looking at the number of star ratings the book received as an indication of how widely read it is, and the number of reviews users wrote, as an indication of how much discussion the book elicited.

Data Exploration¶

First we'll import the Python libraries, then load the csv file into a dataframe, view a few rows, and generate some descriptive stats about the data. Next we'll determine which columns to focus on and what steps we should take to clean the data.

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
data = pd.read_csv('GoodReads_100k_books.csv')
In [2]:
#View how many rows, columns, and non-null values exist in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   author        100000 non-null  object 
 1   bookformat    96772 non-null   object 
 2   desc          93228 non-null   object 
 3   genre         89533 non-null   object 
 4   img           96955 non-null   object 
 5   isbn          85518 non-null   object 
 6   isbn13        88565 non-null   object 
 7   link          100000 non-null  object 
 8   pages         100000 non-null  int64  
 9   rating        100000 non-null  float64
 10  reviews       100000 non-null  int64  
 11  title         99999 non-null   object 
 12  totalratings  100000 non-null  int64  
dtypes: float64(1), int64(3), object(9)
memory usage: 9.9+ MB
In [3]:
#Generate descriptive stats about the numeric columns
data.describe()
Out[3]:
pages rating reviews totalratings
count 100000.000000 100000.000000 100000.000000 1.000000e+05
mean 255.010240 3.833055 181.528450 2.990764e+03
std 367.913582 0.621237 1449.451229 3.635338e+04
min 0.000000 0.000000 0.000000 0.000000e+00
25% 135.000000 3.660000 3.000000 3.100000e+01
50% 240.000000 3.910000 15.000000 1.460000e+02
75% 336.000000 4.140000 67.000000 7.440000e+02
max 70000.000000 5.000000 158776.000000 3.819326e+06
In [4]:
#View the first three rows
data.head(3)
Out[4]:
author bookformat desc genre img isbn isbn13 link pages rating reviews title totalratings
0 Laurence M. Hauptman Hardcover Reveals that several hundred thousand Indians ... History,Military History,Civil War,American Hi... https://i.gr-assets.com/images/S/compressed.ph... 002914180X 9.78E+12 https://goodreads.com/book/show/1001053.Betwee... 0 3.52 5 Between Two Fires: American Indians in the Civ... 33
1 Charlotte Fiell,Emmanuelle Dirix Paperback Fashion Sourcebook - 1920s is the first book i... Couture,Fashion,Historical,Art,Nonfiction https://i.gr-assets.com/images/S/compressed.ph... 1906863482 9.78E+12 https://goodreads.com/book/show/10010552-fashi... 576 4.51 6 Fashion Sourcebook 1920s 41
2 Andy Anderson Paperback The seminal history and analysis of the Hungar... Politics,History https://i.gr-assets.com/images/S/compressed.ph... 948984147 9.78E+12 https://goodreads.com/book/show/1001077.Hungar... 124 4.15 2 Hungary 56 26

These are the findings from the initial exploration of the data:

  • There are 100k rows and 13 columns in the dataset.

  • The genre column contains information about whether the book is fiction or non-fiction as well as other genres.

  • Numeric columns include pages, rating, reviews, and totalratings. All of these may be useful in our analysis, except for pages, because we don't care how long the book is.

  • Books tend to have far more ratings than reviews. It likely takes more effort to write out a review than to leave a rating. Both of these columns are important for our analysis.

  • The star rating indicates how satisfied users are with the book, which could also be important. The max rating is 5, and the average for all books is 3.8, which is fairly high. It's possible users are more inclined to rate books they liked, or our dataset is skewed. This may be less important to us than the number of reviews and ratings, but we will consider it in the analysis.

  • 25% of the dataset had fewer than 31 ratings. This is important because if a book had very few ratings the average rating might be easily skewed.

Let's look at the total and unique values in the non-numeric columns and determine how we should clean the data.

In [5]:
#View the total and unique values for the non-numeric columns
data.describe(exclude="number").iloc[0:2,:]
Out[5]:
author bookformat desc genre img isbn isbn13 link title
count 100000 96772 93228 89533 96955 85518 88565 100000 99999
unique 68767 202 92499 72129 96955 85518 725 100000 97588

These are the findings around data cleaning:

  • While the numeric columns don't have missing values, most of the non-numeric columns have some missing values.

  • There are 72k unique values in the genre column because each book has multiple genres contained in a string, so these will need to be broken apart for them to be useful. The genres are generated from the shelf names of users who add the the book to their shelves. We can expect there to be overlap in the shelf names, and one book may be categorized under many different genres. A fair amount of cleanup and filtering will be needed here.

  • There are repeated values for author, but this is to be expected since authors can publish multiple books.

  • There are only 725 unique values for ISBN13. ISBN13 should be unique to the title and format which is important for determining whether there are duplicates in this dataset. Therefore, ISBN13 probably won't be very useful, but the ISBN column (which is the 10-digit version of the ISBN) had far more unique values (86k) and no repeat values although there are some missing values.

  • There are a lot of book formats (202). While this isn't a primary focus of our analysis it could prove useful, so we'll try to clean these up.

  • There are 2,411 duplicate titles and one missing title. We should see if these are truly duplicates or could they be multiple formats of a single book.

Data Cleaning¶

Now that we know how to clean the data, we'll drop the columns we don't need, reorder the remaining columns, and then see if there are any duplicate titles. Because we have some missing values in the ISBN column, we'll consider an entry to be potentially duplicated if the title, author, ISBN and format all match.

In [6]:
#Drop unneeded columns
data_clean = data.copy()
data_clean.drop(['desc', 'img', 'link', 'pages', 'isbn13'], axis=1, inplace=True)

#Reorder remaining columns
data_clean = data_clean[['title', 'author','bookformat', 'isbn', 'genre', 'rating', 'totalratings', 'reviews']]

#Find potential duplicate books
titles_duplicated = data_clean[data_clean.duplicated(['title', 'isbn', 'author', 'bookformat'], keep=False)]
titles_duplicated.head(10)
Out[6]:
title author bookformat isbn genre rating totalratings reviews
5978 Kiroileva siili Milla Paloniemi Hardcover NaN Sequential Art,Comics,European Literature,Finn... 3.58 124 5
6055 Kiroileva siili Milla Paloniemi Hardcover NaN Sequential Art,Comics,European Literature,Finn... 3.64 75 5
22069 Savage Garden Lee Hyeon-sook Paperback NaN Sequential Art,Manga,Manga,Manhwa,Sequential A... 4.06 138 10
22072 Savage Garden Lee Hyeon-sook Paperback NaN Sequential Art,Manga,Manga,Manhwa,Sequential A... 4.04 89 3
22073 Savage Garden Lee Hyeon-sook Paperback NaN Sequential Art,Manga,Manga,Manhwa,Sequential A... 4.07 86 3
44987 Honggane Mi-Ri Hwang NaN NaN Sequential Art,Manga,Manga,Manhwa,Romance,Mang... 4.02 123 2
44991 Honggane Mi-Ri Hwang NaN NaN Sequential Art,Manga,Manga,Manhwa,Romance,Mang... 4.13 80 1
44993 Honggane Mi-Ri Hwang NaN NaN Sequential Art,Manga,Manga,Manhwa,Romance,Mang... 4.13 77 1
44995 Honggane Mi-Ri Hwang NaN NaN Sequential Art,Manga,Manga,Manhwa,Romance,Mang... 4.11 79 1
44996 Honggane Mi-Ri Hwang NaN NaN Sequential Art,Manga,Manga,Manhwa,Romance,Mang... 4.04 73 1

These look like duplicate titles, but they have different numbers of ratings and reviews. Perhaps they were scraped at different times. Let's remove the duplicates, and keep the entry with the most ratings.

In [7]:
#Remove the duplicate values, keeping the value with the most ratings
data_clean = data_clean.sort_values(by = ['totalratings'], ascending = False)
data_clean = data_clean.drop_duplicates(subset=['title', 'author', 'bookformat', 'isbn'], keep='first')

Which book is missing a title?

In [8]:
data_clean[data_clean['title'].isnull()]
Out[8]:
title author bookformat isbn genre rating totalratings reviews
54953 NaN Jacqui Malpass NaN NaN Diary,Journaling 3.33 9 3

Since neither the title nor the ISBN is available on Goodreads we'll remove this from the dataset.

In [9]:
#Remove the book with no title from the dataset
data_clean = data_clean.dropna(subset=['title'])

#View how many rows are left in the dataset
data_clean.shape
Out[9]:
(99938, 8)

These two steps eliminated 62 rows. Now let's take a look at what the book formats are.

In [10]:
#Display the number of books for each format and display the top 20
data_clean.groupby(['bookformat']).size().sort_values(ascending=False).head(20)
Out[10]:
bookformat
Paperback                53845
Hardcover                27579
ebook                     5328
Kindle Edition            4565
Mass Market Paperback     3024
Unknown Binding            449
Nook                       395
Audio CD                   318
Board Book                 164
Spiral-bound               137
Audiobook                  105
Library Binding             69
Board book                  65
paperback                   38
Audio                       35
Trade Paperback             34
Audio Cassette              29
Leather Bound               29
Comics                      24
Cards                       22
dtype: int64

We'll clean these up by changing the case to lower for consistency, grouping some of the values together (like "mass market paperback" and "trade paperback") and grouping some of the little-used formats into an "other" category.

In [11]:
#Clean the bookformat column
data_clean['bookformat'] = data_clean['bookformat'].str.lower()
data_clean['bookformat'] = data_clean['bookformat'].replace(['kindle edition', 'nook'],'ebook')
data_clean['bookformat'] = data_clean['bookformat'].replace(['mass market paperback', 'softcover', 'trade paperback'],'paperback')
data_clean['bookformat'] = data_clean['bookformat'].replace(['hardback'],'hardcover')
data_clean['bookformat'] = data_clean['bookformat'].replace(['audio cd', 'audio', 'audio cassette', 'audio play','audible audio'],'audiobook')
data_clean['bookformat'] = np.where(data_clean['bookformat'].isin(['paperback','hardcover','ebook','audiobook']), data_clean['bookformat'], 'other')

#Display the total counts and normalized counts for the cleaned formats
formats = pd.concat([data_clean['bookformat'].value_counts(), data_clean['bookformat'].value_counts(normalize=True)], axis=1, keys=('counts','percentage'))

#Format the numeric columns
formats.style.format({'counts':'{:,.0f}',
                      'percentage':'{:.0%}'})
Out[11]:
  counts percentage
paperback 56,949 57%
hardcover 27,613 28%
ebook 10,288 10%
other 4,577 5%
audiobook 511 1%

We can see that paperbacks are by far the most common format on the platform, but hardcovers are significant. Ebooks, Audiobooks, and other formats are far less common.

Now we'll move on the cleaning the genre column. First, we have a number of null values to consider.

In [12]:
#See how many null values there are in the genre column
data_clean['genre'].isnull().sum()
Out[12]:
10467

There are 10,467 rows that don't have the genre filled in at all, so we don't know if those books are fiction or non-fiction. We'll fill these rows with "no genre" so they will be considered as a group in the final analysis. If we see high reviews and high ratings for this group it may be worth doing further exploration to see if we can get genre information for these titles.

In [14]:
#Fill null values in the genre column with "no genre"
data_clean['genre'].fillna("no genre", inplace = True)

Next, we'll do the following steps:

  • Change the genre column to all lowercase to eliminate any duplication due to inconsistent capitalization.

  • Create a column called new_genre that copies the genre column.

  • Clean up some instances in new_genre where "non fiction" includes a space.

  • Create a new column called supergenre that specifies whether the title is fiction or non-fiction. If "nonfiction" appears in the new genre column it will be labeled "non-fiction" with a hyphen.

  • Calculate how many books are non-fiction, or "no value."
In [15]:
#Change the genre column to lowercase
data_clean['genre'] = data_clean['genre'].str.lower()

#Add a column called new_genre which is a copy of genre
data_clean['new_genre'] = data_clean['genre']

#Clean up the variations on "nonfiction"
data_clean['new_genre'] = data_clean['new_genre'].str.replace('non fiction', 'nonfiction')

#Create a new column called supergenre and set the value to "no value"
data_clean['supergenre'] = 'no value'

#When the new_genre column contains "nonfiction" set the supergenre value to "non-fiction"
data_clean.loc[data_clean['new_genre'].str.contains('nonfiction'),'supergenre'] = 'non-fiction'

#Count up the books classified as "no value" and "nonfiction"
data_clean['supergenre'].value_counts()
Out[15]:
no value       70459
non-fiction    29479
Name: supergenre, dtype: int64

Now add some logic so if "fiction" appears anywhere in new_genre and is not preceded by "non" the supergenre will be "fiction" (such as in the case of "historical fiction").

In [16]:
#If the new_genre column contains "fiction" not preceded by "non" set the supergenre value to "fiction"
pattern = '(?<!non)fiction'
data_clean.loc[data_clean['new_genre'].str.contains(pattern),'supergenre'] = 'fiction'

#Display the book counts for the supergenre column
data_clean['supergenre'].value_counts()
Out[16]:
fiction        36480
no value       35060
non-fiction    28398
Name: supergenre, dtype: int64

Adding the fiction categorization reduced the number of non-fiction books from 29,479 to 28,398. So it appears that there are some books that have both "fiction" and "non-fiction" in the new_genre column. Let's take a closer look at what those are.

In [17]:
#Display books that contain both "fiction" and "nonfiction" in the new_genre column
pd.set_option('display.max_colwidth', None)
bool = (data_clean['new_genre'].str.contains('nonfiction')) & data_clean['new_genre'].str.contains('(?<!non)fiction')
data_clean[bool].head(10)
Out[17]:
title author bookformat isbn genre rating totalratings reviews new_genre supergenre
8091 The Art of War Sun Tzu,Thomas Cleary,Pulat Otkan paperback NaN nonfiction,classics,philosophy,history,war,business,politics,war,military fiction,cultural,china,leadership 3.97 360348 12237 nonfiction,classics,philosophy,history,war,business,politics,war,military fiction,cultural,china,leadership fiction
10159 1776 David McCullough paperback 743226720 history,nonfiction,north american hi...,american history,historical,military history,american revolution,war,audiobook,american revolution,american revolutionary war,politics,war,military fiction 4.06 201109 7468 history,nonfiction,north american hi...,american history,historical,military history,american revolution,war,audiobook,american revolution,american revolutionary war,politics,war,military fiction fiction
25441 The Hunger Games: Official Illustrated Movie Companion Kate Egan paperback 545422906 nonfiction,science fiction,dystopia,media tie in,adventure,culture,film,young adult,teen,romance,love,thriller,romance,m m romance 4.52 175110 412 nonfiction,science fiction,dystopia,media tie in,adventure,culture,film,young adult,teen,romance,love,thriller,romance,m m romance fiction
59863 Julie and Julia: 365 Days, 524 Recipes, 1 Tiny Apartment Kitchen Julie Powell hardcover 031610969X nonfiction,autobiography,memoir,food and drink,food,food and drink,cooking,biography,womens fiction,chick lit,biography memoir,humor,food and drink,cookbooks,food and drink,foodie 3.70 157546 7985 nonfiction,autobiography,memoir,food and drink,food,food and drink,cooking,biography,womens fiction,chick lit,biography memoir,humor,food and drink,cookbooks,food and drink,foodie fiction
27395 American Sniper: The Autobiography of the Most Lethal Sniper in U.S. Military History Chris Kyle,Scott McEwen,Jim DeFelice paperback 62107062 nonfiction,biography,war,military fiction,history,war,autobiography,memoir,biography,autobiography,audiobook,military,military history,biography memoir 4.00 116982 8126 nonfiction,biography,war,military fiction,history,war,autobiography,memoir,biography,autobiography,audiobook,military,military history,biography memoir fiction
6265 He's Just Not That Into You: The No-Excuses Truth to Understanding Guys Greg Behrendt,Liz Tuccillo hardcover 141694740X nonfiction,self help,womens fiction,chick lit,relationships,humor,psychology,romance,adult,contemporary,humor,comedy 3.66 64018 3470 nonfiction,self help,womens fiction,chick lit,relationships,humor,psychology,romance,adult,contemporary,humor,comedy fiction
584 Flags of Our Fathers James D. Bradley,Ron Powers paperback 553384155 history,nonfiction,war,military fiction,war,war,world war ii,biography,military,military history,north american hi...,american history,historical,adult 4.19 53110 1644 history,nonfiction,war,military fiction,war,war,world war ii,biography,military,military history,north american hi...,american history,historical,adult fiction
79186 No Easy Day: The Firsthand Account of the Mission That Killed Osama Bin Laden Mark Owen,Kevin Maurer hardcover 525953728 nonfiction,war,military fiction,history,war,biography,autobiography,memoir,military,military history,audiobook,politics,biography,autobiography 3.99 48049 3978 nonfiction,war,military fiction,history,war,biography,autobiography,memoir,military,military history,audiobook,politics,biography,autobiography fiction
24665 Physics of the Impossible Michio Kaku hardcover 385520697 science,nonfiction,science,physics,science,popular science,science,technology,science,astronomy,space,science fiction,time travel,audiobook,reference 4.07 35258 1358 science,nonfiction,science,physics,science,popular science,science,technology,science,astronomy,space,science fiction,time travel,audiobook,reference fiction
16161 The Looming Tower: Al-Qaeda and the Road to 9/11 Lawrence Wright hardcover 037541486X history,nonfiction,politics,war,terrorism,war,religion,north american hi...,american history,religion,islam,audiobook,war,military fiction 4.33 27328 2435 history,nonfiction,politics,war,terrorism,war,religion,north american hi...,american history,religion,islam,audiobook,war,military fiction fiction

We see that these books include biographies, military history, self-help and other topics we normally find in the non-fiction section, so we'll relabel these non-fiction.

In [18]:
#If the book contains both "fiction" and "nonfiction" in the new_genre column label it "non-fiction" in the supergenre column
data_clean.loc[bool, 'supergenre'] = 'non-fiction'
data_clean[bool].head(10)

#Display the book counts for the supergenre column
data_clean['supergenre'].value_counts()
Out[18]:
fiction        35399
no value       35060
non-fiction    29479
Name: supergenre, dtype: int64

We can see that for about a third of our books we don't know if it should be categorized as fiction or non-fiction. We'll have to see if we can categorize them based on other information in the genre column. First we need to transform the new_genre column from strings to lists. Then we'll explode the new genre column which will create one row for every genre in the list. Consequently, a single book can have multiple rows corresponding to each genre listed in its new genres column.

In [19]:
#Split the new_genre string into a list on the comma
data_clean['new_genre'] = data_clean['new_genre'].str.split(",")

#Create a new dataframe with the exploded genres, and show the first 10 rows
exploded_genre = data_clean.explode('new_genre')
exploded_genre.head(10)
Out[19]:
title author bookformat isbn genre rating totalratings reviews new_genre supergenre
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 young adult fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 romance fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 fiction fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 contemporary fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 realistic fiction fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 young adult fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 teen fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 young adult fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 coming of age fiction
27180 The Fault in Our Stars John Green hardcover NaN young adult,romance,fiction,contemporary,realistic fiction,young adult,teen,young adult,coming of age,novels,drama,love 4.19 3819326 158776 novels fiction

Let's verify that this worked by checking a single ISBN in the old and the new dataframe. In the old data we expect to see one row per ISBN.

In [20]:
pd.set_option("display.max_colwidth", None)
data_clean[data_clean['isbn'] == '002914180X']
Out[20]:
title author bookformat isbn genre rating totalratings reviews new_genre supergenre
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 [history, military history, civil war, american history, american civil war, nonfiction, north american hi..., american history, native americans] non-fiction

In the new dataframe we expect to see 9 rows, one for each genre in the new_genre column, which is what we're seeing.

In [21]:
exploded_genre[exploded_genre['isbn'] == '002914180X']
Out[21]:
title author bookformat isbn genre rating totalratings reviews new_genre supergenre
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 history non-fiction
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 military history non-fiction
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 civil war non-fiction
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 american history non-fiction
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 american civil war non-fiction
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 nonfiction non-fiction
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 north american hi... non-fiction
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 american history non-fiction
0 Between Two Fires: American Indians in the Civil War Laurence M. Hauptman hardcover 002914180X history,military history,civil war,american history,american civil war,nonfiction,north american hi...,american history,native americans 3.52 33 5 native americans non-fiction

We can now quickly see if some books had duplicate genres and remove duplicate rows, so that each book, format, author and ISBN has only one row for each applicable genre.

In [22]:
#Check the number of rows in the exploded dataset
exploded_genre.shape
Out[22]:
(831056, 10)
In [23]:
#Remove duplicate rows where title, author, format, ISBN, and genre are all the same
exploded_genre[exploded_genre.duplicated(subset=['title', 'author', 'bookformat', 'isbn','new_genre'], keep=False)]
exploded_genre = exploded_genre.sort_values(by = ['totalratings'], ascending = False)
exploded_genre = exploded_genre.drop_duplicates(subset=['title', 'author', 'bookformat', 'isbn', 'new_genre'], keep='first')

#Check the number of rows that remain
exploded_genre.shape
Out[23]:
(678793, 10)

This step removed 152,263 rows. Now let's see how many fiction and non-fiction books we have, and how their average ratings compare, by aggregating data in our unexploded dataset (which has one entry per book).

In [24]:
#Find the counts, average rating, average ratings, and average reviews for each supergenre
supergenre_summary = data_clean.groupby('supergenre').agg(genre_count=('supergenre', 'count'), 
                               avg_rating=('rating', 'mean'),
                                               avg_ratings=('totalratings', 'mean'),
                                               avg_reviews=('reviews', 'mean'))
supergenre_summary.sort_values(by = 'genre_count', ascending=False)

#Format the numeric columns
supergenre_summary.style.format({'genre_count':'{:,.0f}',
                                'avg_rating':'{:,.2f}',
                                'avg_ratings':'{:,.0f}',
                                'avg_reviews':'{:,.0f}'})
Out[24]:
  genre_count avg_rating avg_ratings avg_reviews
supergenre        
fiction 35,399 3.85 7,059 415
no value 35,060 3.71 197 15
non-fiction 29,479 3.95 1,435 100

There are a lot of books with "no value" for the supergenre which means it didn't have a value of "fiction" or "non-fiction" in the new_genres column so we don't know how to classify it. Let's see if we can classify some of these by looking at what the most common genres are when the supergenre is "no value."

In [25]:
#View the most popular new_genre when the supergenre = no value
missing_genre_summary = exploded_genre[exploded_genre['supergenre'] == 'no value'].groupby('new_genre').agg(genre_count=('new_genre', 'count'), 
                               avg_rating=('rating', 'mean'),
                                               avg_ratings=('totalratings', 'mean'),
                                               avg_reviews=('reviews', 'mean'))
missing_genre_summary = missing_genre_summary.sort_values(by = 'genre_count', ascending=False)

#Display the top 30 results and format the numeric columns
missing_genre_summary.head(30).style.format({'genre_count':'{:,.0f}',
                                'avg_rating':'{:,.2f}',
                                'avg_ratings':'{:,.0f}',
                                'avg_reviews':'{:,.0f}'})
Out[25]:
  genre_count avg_rating avg_ratings avg_reviews
new_genre        
no genre 10,467 3.34 10 1
romance 4,569 3.79 1,137 81
fantasy 2,909 3.89 1,512 102
sequential art 2,280 3.92 422 16
history 2,035 3.90 46 5
cultural 1,921 3.88 143 12
childrens 1,813 3.79 155 19
paranormal 1,695 3.87 2,375 164
religion 1,617 4.06 70 8
science 1,464 3.91 63 7
comics 1,388 3.90 529 20
manga 1,359 4.01 565 18
european literature 1,340 3.80 145 9
contemporary 1,303 3.75 881 77
graphic novels 1,132 3.94 726 26
picture books 1,091 3.77 132 22
category romance 1,075 3.57 123 14
reference 1,065 4.01 74 4
harlequin 1,034 3.51 124 14
m m romance 1,017 3.74 536 57
young adult 998 3.97 2,704 172
games 965 3.84 56 2
lgbt 925 3.83 583 58
art 923 4.04 80 6
paranormal romance 898 3.91 3,773 259
poetry 884 4.01 180 15
contemporary romance 843 3.66 802 70
animals 727 3.89 459 37
philosophy 719 4.02 62 6
magic 703 3.94 2,958 194

We'll remove the "no genre" books from our dataset because we don't have a way to get their genre, they only averaged about 1 review per title. We're most interested in the books that have high reviews. We'll remove all children's and picture books from the dataset since our group won't be for children. Other genres we can verify by checking them here. Anything with a genre of fantasy, sequential art, comics, manga, graphic novels, we'll categorize as fiction. And if the genre contains romance we'll also categorize these as fiction, because the romance genres appear under fiction on the Goodreads genre list.

In [26]:
#Remove childrens and "no genre" books and categorize comics/graphic novels and fantasy as fiction
data_clean_genre = data_clean[data_clean['genre'].str.contains('no genre|childrens|picture books', regex=True)==False]
data_clean_genre.loc[data_clean['genre'].str.contains('romance|fantasy|sequential art|comics|manga|graphic novels', regex=True),'supergenre'] = 'fiction'

Now that we've categorized the books by supergenre as best we can and cleaned up the genre column we can move on to the final analysis and visualizations.

Data Summary: Fiction vs Non-Fiction¶

Let's visualize our cleaned-up dataset to see whether fiction or non-fiction books tend to get more engagement, which we'll quantify by looking at the number of reviews per title for each.

In [27]:
#Create a plotly box plot of reviews per title in our cleaned dataset
fig = px.box(data_clean_genre[data_clean_genre['supergenre'] != 'no value'], x="supergenre", y="reviews",
             range_y=[0,450],
            points='outliers',
            title='Fiction Books Get More Reviews Than Non-Fiction',)

fig.show()
fictionnon-fiction050100150200250300350400450
Fiction Books Get More Reviews Than Non-Fictionsupergenrereviews
plotly-logomark
  • The box plots show is that fiction books tend to get much higher numbers of reviews. The median for fiction is 46, while the median for non-fiction is 17. The 3rd quartile for fiction is 167 while for non-fiction it's 52.

  • This also shows us there are many outliers for both groups so when evaluating the genres and supergenres we'll use the median ratings and median reviews, because the median is less likely to be skewed by extreme outliers than the average.

Next we'll show the results in chart format, specifically the median number of ratings and reviews. For the 5-star rating we'll use the average (since the rating only goes up to 5) but if the book received fewer than 20 ratings we'll exclude it from the average star rating. We'll also calculate a new column for review ratio, which is the total number of reviews divided by the total number of ratings. This tells us, for users who rated a book, how likely were they to also leave a review.

In [28]:
#Add a column for review ratio
data_clean_genre_with_new_col = data_clean_genre.copy()
data_clean_genre_with_new_col['review_ratio'] = data_clean_genre_with_new_col['reviews'] / data_clean_genre_with_new_col['totalratings']
In [29]:
#Generate numerical chart grouped by supergenre with counts, median ratings, and median reviews
supergenre_summary = data_clean_genre_with_new_col.groupby('supergenre').agg(genre_count=('supergenre', 'count'), 
                                                                median_ratings=('totalratings', 'median'),
                                                               median_reviews=('reviews', 'median'),
                                                               median_review_ratio=('review_ratio', 'median'))
supergenre_summary.reset_index()

#Calculate the average star rating by supergenre excluding books with less than 20 ratings. 
data_clean_ratings = data_clean_genre_with_new_col[data_clean_genre_with_new_col['totalratings'] >= 20]
supergenre_summary_ratings = data_clean_ratings.groupby('supergenre').agg(avg_star_rating=('rating', 'mean'))
supergenre_summary = supergenre_summary.merge(supergenre_summary_ratings, on='supergenre', how='left')

#Format the numeric columns
format_dict = {'genre_count':'{:,.0f}'
,'median_ratings':'{:,.0f}'
,'median_reviews':'{:,.0f}'
,'avg_star_rating':'{:,.3f}'
,'median_review_ratio':'{:,.3f}'
}

#Reorder the columns, sort by median reviews and format the numeric columns
supergenre_summary = supergenre_summary[['genre_count', 
                                         'median_ratings', 
                                         'median_reviews', 
                                         'avg_star_rating',
                                         'median_review_ratio']]
supergenre_summary = supergenre_summary.sort_values('median_reviews', ascending=False).reset_index()
supergenre_summary.style.hide(axis="index").format(format_dict)
Out[29]:
supergenre genre_count median_ratings median_reviews avg_star_rating median_review_ratio
fiction 38,524 547 46 3.844 0.088
non-fiction 27,110 163 17 3.956 0.104
no value 15,974 23 2 3.949 0.083
  • The results support our initial hypothesis that fiction had more appeal for an online book group. Fiction has roughly 3x higher median ratings and reviews than non-fiction. The average star rating was only 3% lower than non-fiction so readers rated them similarly in terms of satisfaction.

  • The median review ratio shows that for readers of non-fiction, they may be 18% more likely to leave a review if they rated the book. This may be because non-fiction has smaller, more niche audiences that are a bit more engaged, but since fiction attracts a much bigger audience, we'll still recommend focusing on fiction books.

  • The numerical chart also shows that we've reduced the number of titles with "no value" for fiction or non-fiction from 35% to 20% of our dataset. These titles tend to have low ratings and reviews, so we'll ignore them for the time being.

Data Summary: Book Formats¶

Now we want to see whether the format of the book has any bearing on how successfully it attracts reviews. We'll focus on fiction books and on the four most popular formats (hardcover, ebook, paperback, and audiobook), summarize the results and then visualize the median reviews for each format.

In [30]:
#Calculate the counts, median reviews, median ratings, and median review ratio by format
supergenre_format_summary = data_clean_genre_with_new_col.groupby(['supergenre', 'bookformat']).agg(genre_count=('supergenre', 'count'), 
                                                                median_ratings=('totalratings', 'median'),
                                                               median_reviews=('reviews', 'median'),
                                                                median_review_ratio=('review_ratio', 'median'))
supergenre_format_summary.reset_index()

#Calculate the average star rating by supergenre excluding books with less than 20 ratings.
supergenre_format_summary_ratings = data_clean_ratings.groupby(['supergenre','bookformat']).agg(avg_star_rating=('rating', 'mean'))

#Merge the average star rating to the other data
supergenre_format_summary = supergenre_format_summary.merge(supergenre_format_summary_ratings, on=['supergenre', 'bookformat'], how='left')

#Reorder columns, sort by median reviews, and filter for popular fiction formats
supergenre_format_summary = supergenre_format_summary[['genre_count', 'median_ratings', 'median_reviews', 'avg_star_rating', 'median_review_ratio']]
supergenre_format_summary = supergenre_format_summary.sort_values(by='median_reviews', ascending=False).reset_index()
supergenre_format_summary = supergenre_format_summary[(supergenre_format_summary['bookformat'] != 'other') & (supergenre_format_summary['supergenre'] == 'fiction')]
supergenre_format_summary.style.hide(axis="index").format(format_dict)
Out[30]:
supergenre bookformat genre_count median_ratings median_reviews avg_star_rating median_review_ratio
fiction hardcover 6,686 1,101 114 3.823 0.105
fiction ebook 7,442 548 50 3.815 0.092
fiction paperback 22,780 482 40 3.856 0.083
fiction audiobook 290 272 24 3.773 0.092
In [32]:
#Create a bar chart of the median reviews from the format summary table
df = supergenre_format_summary
fig = px.bar(df, x="median_reviews", y="bookformat", orientation='h',
                 hover_data={'median_reviews':True,
                             'bookformat':False},
                 hover_name='bookformat',
                title='Hardcover Fiction Books Get More Reviews Than Other Formats',
                labels={'median_reviews': 'median reviews',
                       'bookformat': 'format'}
            )
fig.show()
020406080100120hardcoverebookpaperbackaudiobook
Hardcover Fiction Books Get More Reviews Than Other Formatsmedian reviewsformat
plotly-logomark
  • While paperbacks outnumber hardcovers 3 to 1, hardcover books have 3x higher median reviews and 2x higher ratings. Their review ratio is the highest of all types, meaning audiences that read and rate are also the most likely to leave a review as well.

  • The difference in average star rating was very slight, so readers were about as satisfied regardless of format.

  • These results may indicate that books tend to get higher reviews when they're new since publisher typically release books in hardcover first. They may speak to the higher quality of books that appear in hardcover format since publishers put their highest-priority titles in hardcover.

Data Summary: Top Fiction Genres¶

Now we want to examine whether there are specific genres of fiction books that tend to attract the greatest number of reviews. First we'll look at how many subgenres exist within fiction.

In [33]:
#Filter our cleaned dataset for fiction
data_clean_fiction = data_clean_genre_with_new_col[data_clean_genre['supergenre'] == 'fiction']

#Explode the new_genre column, allowing us to count the genres
exploded_data_clean_fiction = data_clean_fiction.explode('new_genre')
exploded_data_clean_fiction = exploded_data_clean_fiction.sort_values(by = ['totalratings'], ascending = False)

#Drop duplicate genres from the exploded dataset
exploded_data_clean_fiction = exploded_data_clean_fiction.drop_duplicates(subset=['title', 'author', 'bookformat', 'isbn', 'new_genre'], keep='first')

#Calculate the book counts, median ratings, median reviews, and median ratings ratio for each genre
genre_summary = exploded_data_clean_fiction.groupby('new_genre').agg(genre_count=('new_genre', 'count'), 
                                                        median_ratings=('totalratings', 'median'),
                                                        median_reviews=('reviews', 'median'),
                                                        median_review_ratio=('review_ratio', 'median'))
genre_summary.sort_values(by = 'genre_count', ascending=False).reset_index()
genre_summary = genre_summary[['genre_count', 'median_ratings', 'median_reviews', 'median_review_ratio']]

#Summarize the genre results
genre_summary.describe()
Out[33]:
genre_count median_ratings median_reviews median_review_ratio
count 959.000000 959.000000 959.000000 959.000000
mean 364.544317 872.965589 79.163191 0.103770
std 1373.989887 1413.337788 125.915635 0.049932
min 1.000000 16.000000 1.000000 0.008403
25% 5.500000 269.000000 26.000000 0.073956
50% 36.000000 572.000000 51.000000 0.096338
75% 200.000000 1060.750000 93.000000 0.123965
max 25103.000000 32453.000000 2901.000000 0.522013

We have 959 genres which is quite a lot, and some of these have as few as one title within them. We want to evaluate genres that have a large number of titles because a larger sample will give us more reliable results, and we also want to be sure there's a large number of potential titles to pick for our book group. To get a sense of how many titles we should be looking for we'll look at the books counts for the genres in the top percentiles.

In [34]:
#Show the book counts for the 75th-99th percentiles 
n = [75, 80, 85, 90, 95, 99]
for i in n:
    print("The " + str(i) + "th percentile book count is: ", round(np.percentile(genre_summary['genre_count'], i),0))
The 75th percentile book count is:  200.0
The 80th percentile book count is:  263.0
The 85th percentile book count is:  401.0
The 90th percentile book count is:  636.0
The 95th percentile book count is:  1562.0
The 99th percentile book count is:  5733.0

To simplify this dataset of 959 genres, we'll pick a round number between the 90th-95th percentile and say the genre needs to have at least 1,000 books. Next we'll calculate the average star rating per genre like we did before, by counting the book if it had at least 20 reviews.

In [35]:
#Calculate the average star rating for each genre, including books with at least 20 ratings
rating_summary = exploded_data_clean_fiction[exploded_data_clean_fiction['totalratings'] >= 20].groupby('new_genre').agg(avg_star_rating=('rating', 'mean'))
rating_summary = rating_summary.sort_values(by = 'avg_star_rating', ascending=False).reset_index()

#Merger the star rating with the other data
genre_summary = genre_summary.merge(rating_summary, on='new_genre', how='left')

#Reorder the columns
genre_summary = genre_summary[['new_genre', 'genre_count', 'median_ratings', 'median_reviews', 'avg_star_rating', 'median_review_ratio']]
In [36]:
#Filter the genres for only those with at least 1000 books in the dataset
genre_summary_top = genre_summary[genre_summary['genre_count'] >= 1000].sort_values(by='genre_count', ascending=False)
genre_summary_top.describe()
Out[36]:
genre_count median_ratings median_reviews avg_star_rating median_review_ratio
count 74.000000 74.000000 74.000000 74.000000 74.000000
mean 3437.783784 1136.939189 93.912162 3.865502 0.084631
std 3748.518221 760.327633 68.972019 0.086889 0.017961
min 1001.000000 132.000000 15.000000 3.597572 0.031645
25% 1342.000000 537.000000 49.250000 3.813099 0.074739
50% 2405.500000 985.000000 73.000000 3.854770 0.085250
75% 4083.750000 1501.000000 117.750000 3.917580 0.094691
max 25103.000000 3614.000000 314.000000 4.146836 0.120338

This narrows down the group to 74 genres, which is still a large number. We'll look at only the ones that are in the 75th percentile for ratings and reviews, and the 50th percentile for rating ratio, from this group. There isn't much variation in the star rating (the standard deviation is only .09) so we'll just focus on those three criteria.

We'll also remove some genres that aren't terribly meaningful from the result set like "novels," "adult," "audiobook," and "american."

In [37]:
#Further filter genre list for the top in median reviews, median ratings, and median ratings ratio
recommended = (genre_summary_top['median_reviews'] >= 118) & (genre_summary_top['median_ratings'] >= 1501) & (genre_summary_top['median_review_ratio'] >= .085) & (genre_summary_top['new_genre'] != 'novels') & (genre_summary_top['new_genre'] != 'audiobook') & (genre_summary_top['new_genre'] != 'adult') & (genre_summary_top['new_genre'] != 'american')
genre_summary_top_recommended = genre_summary_top[recommended].sort_values(by='median_reviews', ascending = False)

#Format the numeric columns
genre_summary_top_recommended.style.hide(axis="index").format(format_dict)
Out[37]:
new_genre genre_count median_ratings median_reviews avg_star_rating median_review_ratio
teen 1,001 2,719 314 3.792 0.120
womens fiction 2,068 3,243 278 3.814 0.085
mystery thriller 2,416 2,461 212 3.859 0.089
literary fiction 1,622 1,742 201 3.742 0.117
dystopia 1,173 1,896 198 3.846 0.108
suspense 2,395 2,351 190 3.901 0.085
thriller 3,873 2,019 171 3.849 0.088
young adult 5,037 1,849 153 3.899 0.092
crime 2,973 1,817 151 3.858 0.087
british literature 1,698 1,513 145 3.801 0.093
In [38]:
#Create a bubble chart of the top genres plotted with median reviews on x-axis, median ratings on y-axis, and book count for bubble size
df = genre_summary_top_recommended
fig = px.scatter(df, x="median_reviews", y="median_ratings",
                 color="new_genre",
                 size='genre_count', 
                 hover_data={'new_genre':False,
                             'median_reviews':True,
                             'median_ratings':True,
                             'genre_count':False},
                 hover_name='new_genre',
                range_x=[100,350],
                range_y=[100,3500],
                title='Top Fiction Genres For Median Reviews',
                labels={'median_ratings': 'median ratings',
                       'median_reviews': 'median reviews',
                       'new_genre': 'genre'},
                text='new_genre',
                size_max=40,
                opacity=.5)
fig.update_layout(
    font=dict(
        size=10
    )
)
fig.show()
teenwomens fictionmystery thrillerliterary fictiondystopiasuspensethrilleryoung adultcrimebritish literature100150200250300350500100015002000250030003500
genreteenwomens fictionmystery thrillerliterary fictiondystopiasuspensethrilleryoung adultcrimebritish literatureTop Fiction Genres For Median Reviewsmedian reviewsmedian ratings
plotly-logomark

Based on these results a few themes emerge:

  • Fiction books for teens aged 13 to 18 was a breakout fiction genre with the highest median reviews. It also had the highest review-to-rating ratio. But we want our group to be open to adult readers, so we'll rule this out.

  • Women's fiction was the next-highest fiction genre for median reviews and the highest for ratings. A book group for women's fiction is an intriguing choice, but we want our group to be open to all genders, so we'll rule this out.

  • Mystery thrillers was third for ratings and reviews. Close relatives "suspense" and "thriller" also appear on this list. This could be a good potential choice for our group.

  • Literary fiction--a broad category for books that aspire to literary merit--was fourth for median reviews and had the second-highest reviews-to-ratings ratio. This means that, though mystery thrillers may reach a larger audience of readers, the readers of literary fiction were about 30% more likely to leave a review, indicating these books can generate a lot of discussion.

We'll recommend targeting the book group at literary fiction readers since this is likely to appeal to a large audience, to generate a high amount of discussion, and will be inclusive of different age groups and genders.

Data Summary: Top Titles¶

Let's look at the top literary fiction books for reviews from this dataset to get a sense of which specific titles we could recommend for the book group.

In [46]:
#Format the numeric columns
titles_format_dict = {'totalratings':'{:,.0f}'
,'reviews':'{:,.0f}'
,'review_ratio':'{:,.3f}'
}

#Display the top literary fiction for adults sorted by total reviews
literary_fiction = (data_clean_genre_with_new_col['genre'].str.contains('literary fiction', regex=False)) & (~data_clean_genre_with_new_col['genre'].str.contains('young adult', regex=False))
data_clean_genre_with_new_col.loc[literary_fiction, ['title', 'author','bookformat','totalratings','reviews','review_ratio']].sort_values(by='reviews', ascending=False).head(10).style.hide(axis="index").format(titles_format_dict)
Out[46]:
title author bookformat totalratings reviews review_ratio
And the Mountains Echoed Khaled Hosseini hardcover 319,213 28,936 0.091
Americanah Chimamanda Ngozi Adichie,Jashar Awan hardcover 295,475 24,915 0.084
Life After Life Kate Atkinson hardcover 207,029 24,729 0.119
Norwegian Wood Haruki Murakami,Jay Rubin paperback 385,612 23,958 0.062
The Unlikely Pilgrimage of Harold Fry Rachel Joyce hardcover 157,045 18,179 0.116
The Sense of an Ending Julian Barnes hardcover 156,819 16,481 0.105
Rules of Civility Amor Towles hardcover 160,843 15,917 0.099
Beautiful Ruins Jess Walter hardcover 163,460 15,682 0.096
Stoner John Williams,John McGahern paperback 113,334 12,547 0.111
We Are All Completely Beside Ourselves Karen Joy Fowler hardcover 110,850 11,696 0.106

Almost all of these titles (with the exception of Norweigian Wood and Stoner) were hardcover editions published in the 2010s. Therefore we could focus our group on new literary fiction in hardcover editions, since we've seen that hardcovers tend to elicit more reviews. We could see if any of these authors of the top books have new books coming out. For example, Karen Joy Fowler published a new book in March 2022, Julian Barnes published a new book in August 2022, and Kate Atkinson published a new book in September 2022.